76 research outputs found
TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation
This paper addresses the challenge of transferring the behavior expressivity
style of a virtual agent to another one while preserving behaviors shape as
they carry communicative meaning. Behavior expressivity style is viewed here as
the qualitative properties of behaviors. We propose TranSTYLer, a multimodal
transformer based model that synthesizes the multimodal behaviors of a source
speaker with the style of a target speaker. We assume that behavior
expressivity style is encoded across various modalities of communication,
including text, speech, body gestures, and facial expressions. The model
employs a style and content disentanglement schema to ensure that the
transferred style does not interfere with the meaning conveyed by the source
behaviors. Our approach eliminates the need for style labels and allows the
generalization to styles that have not been seen during the training phase. We
train our model on the PATS corpus, which we extended to include dialog acts
and 2D facial landmarks. Objective and subjective evaluations show that our
model outperforms state of the art models in style transfer for both seen and
unseen styles during training. To tackle the issues of style and content
leakage that may arise, we propose a methodology to assess the degree to which
behavior and gestures associated with the target style are successfully
transferred, while ensuring the preservation of the ones related to the source
content
Design and Evaluation of Shared Prosodic Annotation for Spontaneous French Speech: From Expert Knowledge to Non-Expert Annotation
International audienceIn the area of large French speech corpora, there is a demonstrated need for a common prosodic notation system allowing for easy data exchange, comparison, and automatic annotation. The major questions are: (1) how to develop a single simple scheme of prosodic transcription which could form the basis of guidelines for non-expert manual annotation (NEMA), used for linguistic teaching and research; (2) based on this NEMA, how to establish reference prosodic corpora (RPC) for different discourse genres (Cresti and Moneglia, 2005); (3) how to use the RPC to develop corpus-based learning methods for automatic prosodic labelling in spontaneous speech (Buhman et al., 2002; Tamburini and Caini 2005, Avanzi, et al. 2010). This paper presents two pilot experiments conducted with a consortium of 15 French experts in prosody in order to provide a prosodic transcription framework (transcription methodology and transcription reliability measures) and to establish reference prosodic corpora in French
Stylization and Trajectory Modelling of Short and Long Term Speech Prosody Variations
International audienceIn this paper, a unified trajectory model based on the stylization and the modelling of f0 variations simultaneously over various temporal domains is proposed. The syllable is used as the minimal temporal domain for the description of speech prosody, and short-term and long-term f0 variations are stylized and modelled simultaneously over various temporal domains. During the training, a context-dependent model is estimated according to the joint stylized f0 contours over the syllable and a set of long-term temporal domains. During the synthesis, f0 variations are determined using the long-term variations as trajectory constraints. In a subjective evaluation in speech synthesis, the stylization and trajectory modelling of short and long term speech prosody variations is shown to consistently model speech prosody and to outperform the conventional short-term modelling
A Multi-Level Context-Dependent Prosodic Model applied to duration modeling
International audienceon the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllablebased durational parameters on two read speech corpora - laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model's complexity, when showing comparable performance in terms of relative prediction error. Index Terms : speech synthesis, prosody, multi-level model, context-dependent model
Voice Reenactment with F0 and timing constraints and adversarial learning of conversions
This paper introduces voice reenactement as the task of voice conversion (VC)
in which the expressivity of the source speaker is preserved during conversion
while the identity of a target speaker is transferred. To do so, an original
neural- VC architecture is proposed based on sequence-to-sequence voice
conversion (S2S-VC) in which the speech prosody of the source speaker is
preserved during conversion. First, the S2S-VC architecture is modified so as
to synchronize the converted speech with the source speech by mean of phonetic
duration encoding; second, the decoder is conditioned on the desired sequence
of F0- values and an explicit F0-loss is formulated between the F0 of the
source speaker and the one of the converted speech. Besides, an adversarial
learning of conversions is integrated within the S2S-VC architecture so as to
exploit both advantages of reconstruction of original speech and converted
speech with manipulated attributes during training and then reducing the
inconsistency between training and conversion. An experimental evaluation on
the VCTK speech database shows that the speech prosody can be efficiently
preserved during conversion, and that the proposed adversarial learning
consistently improves the conversion and the naturalness of the reenacted
speech.Comment: arXiv admin note: text overlap with arXiv:2107.1234
Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech using Adversarial Disentanglement of Multimodal Style Encoding
Modeling virtual agents with behavior style is one factor for personalizing
human agent interaction. We propose an efficient yet effective machine learning
approach to synthesize gestures driven by prosodic features and text in the
style of different speakers including those unseen during training. Our model
performs zero shot multimodal style transfer driven by multimodal data from the
PATS database containing videos of various speakers. We view style as being
pervasive while speaking, it colors the communicative behaviors expressivity
while speech content is carried by multimodal signals and text. This
disentanglement scheme of content and style allows us to directly infer the
style embedding even of speaker whose data are not part of the training phase,
without requiring any further training or fine tuning. The first goal of our
model is to generate the gestures of a source speaker based on the content of
two audio and text modalities. The second goal is to condition the source
speaker predicted gestures on the multimodal behavior style embedding of a
target speaker. The third goal is to allow zero shot style transfer of speakers
unseen during training without retraining the model. Our system consists of:
(1) a speaker style encoder network that learns to generate a fixed dimensional
speaker embedding style from a target speaker multimodal data and (2) a
sequence to sequence synthesis network that synthesizes gestures based on the
content of the input modalities of a source speaker and conditioned on the
speaker style embedding. We evaluate that our model can synthesize gestures of
a source speaker and transfer the knowledge of target speaker style variability
to the gesture generation task in a zero shot setup. We convert the 2D gestures
to 3D poses and produce 3D animations. We conduct objective and subjective
evaluations to validate our approach and compare it with a baseline
Vers une modélisation continue de la structure prosodique: le cas des proéminences syllabiques
L'objectif de cet article est de présenter un outil développé en vue de modéliser semi-automatiquement la structure prosodique du français. Sur la base d'un alignement en phonèmes, notre système procède à la détection des syllabes proéminentes en prenant en considération des critères acoustiques basiques tels que la f0, la durée et la présence de pauses. À partir des mesures ainsi prises, le système attribue un degré de proéminence à chacune des syllabes identifiées comme saillante. Nous illustrons ensuite les résultats de l'analyse d'extraits du corpus PROSO_FR. Plus précisément, nous comparons l'analyse prosodique de phrases que l'on pourrait faire avec les règles traditionnelles de la phonologie prosodique avec l'analyse conduite par notre logiciel. Nous discutons ainsi de trois règles: la règle de dominance droite, la règle de clash accentuel et la règle des sept syllabe
Comparaison de trois outils de détection automatique de proéminences en français parlé
This paper presents the inner details of three differentalgorithms for prominence detection. On the basis of a 50-minute corpus made of five speaking styles and manuallyannotated for prominence, a quantitative evaluationcompares the three approaches
- …